Accelerator Architecture
for Sparse High-Order
Tensor Contraction

Gabriel Kulp, Andrew Ensinger
June 7, 2023

Introduction

FLAASH: Flexible Accelerator Architecture
for Sparse High-Order Tensor Contraction

Gabriel Kulp, Andrew Ensinger, Lizhong Chen

Paper submitted to ICML two weeks ago

Roadmap

  1. Introduction (you are here)
  2. Background and Motivation
    • What is sparse tensor contraction?
    • What are accelerators?
  3. Methods and Results
    • What did we do?
    • Did it work?
  4. Future Work
    • Where can we collaborate?

Summary

  • We designed an architecture in simulation
  • It contracts sparse high-order tensors
  • We also implemented it in Verilog
  • The benchmark does well (~25x)

Background

Things I want to teach you

  • What kind of math is this?
  • Why would anyone want that?
  • What have others done before us?

What is a tensor?

  • Multi-dimensional array
  • Represented multilinear transform
  • Many applications

Vocab

  • Mode: single dimension/direction
  • Order: how many modes
  • Fiber: row, column, etc.
  • Coordinate: which row and column?
  • Volume: product of mode lengths

High-Order? Sparse?

“High-order” ≈ more than 3 modes

Tensor contraction

  • Matrix multiplication but general
  • Choose one direction each, then dot everything

What is an accelerator?

  • CPUs are general
  • GPUs are more specific
  • GPUs are accelerators
  • Accelerators do something specific
  1. CPU gives the problem to the accelerator
  2. Accelerator solves it
  3. Results sent back to CPU

Challenges

  1. Control flow
  2. Memory access
  3. Workload balance

Prior Work

  • Focuses on 2D data only
  • Focuses on structured sparsity
  • Only targets software

Focuses on 2D data only

  • Y. Wang, C. Zhang, Z. Xie, C. Guo, Y.Liu, and J. Leng
  • C. Tillinghast and S. Zhe
  • J. Liu, J. Ren, R. Gioiosa, D. Li, and J. Li

Focuses on structured sparsity

  • Z. Chen, Z. Qu, L. Liu, Y. Ding, and Y. Xie
  • J. Yue, et al. (STICKER-IM)

Only targets software

  • F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe
  • Z. Du, Y. Guan, T. Guan, D. Niu, L. Huang, H. Zheng, and Y. Xie

Methods

Overview

Four parts

  1. Tensor Memory
  2. Job Generation and Dispatch
  3. Processing Elements (“workers”)
  4. Input/Output Interface

Tensor Memory

  • Convert coordinates to pointers
  • Manage data structures
    • Bookkeeping
    • Dynamic maintenance
  • Manage caching

Job Generation and Dispatch

  • List all dot products (on-demand)
  • Queue them up
  • Send to workers
  • Wait until queue is empty and workers are done

Sparse Dot Product Engines

Five parts:

  1. Left fiber loader
  2. Right fiber loader
  3. Intersection and MAC
  4. Storage queue
  5. Job queue

Worker function

  • Fetch fibers from operands
  • “Zip” together the fibers
  • Seek “intersections” where index matches
  • Multiply and accumulate
  • Send result to memory when done

Input/Output Interface

  • Handle allocation and transfer of tensors
  • Provide connection to physical memory
    • Queuing and batching for DRAM
    • Or DMA to the host

Results

How did I get results?

  • Wrote accelerator in SystemC HLS
  • Wrote driver and library in Python
  • Compare simulated performance to software
  • Baseline is PyData sparse library

How did it do?

  • Linear time complexity as expected
  • (add something)

Conclusion

Summary

  1. Pick a niche problem that’s useful
  2. Design architecture that fixes existing shortcomings
  3. Begin to evaluate with a simulation

Future work

  • Optimize jobs for cache residency
  • Use a hash tree instead of CSF
  • Integrate with numpy and/or PyTorch

Thank You!

Questions?

  • Schartz Rehan, “The Shape of Tensor”
  • Wikipedia “Adjacency Matrix”
  • Majestic SEO “Link Graph”
  • Apple Dev Documentation “Sparse Solvers”
  • XKCD Comic 2347
  • Philipp Niemann, et al. “Advanced exact synthesis of Clifford+T circuits”
  • Silvia Bonfanti, et al. “Molecular mechanisms of heterogeneous oligomerization of huntingtin proteins”
  • NVIDIA dev blog, “Accelerating Inference with Sparsity”